Fifa 19 Transfer Market Value Analysis


Introduction

The FIFA series is a collection of footbal simulation video games developed by EA Sports. Each year, a new game is released for millions of fans around the world.
The game is well known for its close resemblance to real life football, adding many features that mimic the professional sport. One of these features is including transfer market value information for each player. This value represents how much clubs would have to pay in order to aquire a certain player.

Objectives

The goal of this project is to gather and analyze FIFA player data then predict the transfer market value tier of each player through machine learning.

Summary

In this project we will be following the machine learning workflow to tackle this investigation. Below are the highlights of each step.

Cells

Section

Highlights

1-44 Data Engineering
  • Psycopg2 and DDL to define + create a local PostgreSQL database table and ingest raw data.
  • SQL to query the data from Postgres table
  • Pandas and NumPy for significant data pre-processing
45-50 Exploratory Data Analysis
  • Matplotlib, Seaborne and Plotly libraries for data visualization
  • Insight gathering through different plots
  • Outlier detection
51-79 Feature Engineering
  • Exploring and determining transfer market tier levels
  • Converting binary categorical fields into numerical fields
  • Handling under-representation of _'high market value'_ tier using SMOTE algorithm
  • Correlation Analysis and Recursive Feature Elimination(RFE) for feature selection
80-82 Data Modeling
  • SciKit-Learn for supervised learning logistic regression model
83-85 Model Evaluation and Performance
  • Accuracy, Precision and Recall score computation and analysis
  • Confusion Matrix analysis
  • ROC Curve Plot and AUC value computation for false positive rate vs true positive rate analysis


Data Engineering and Pre-Processing:


Creating and defining table on PostgreSQL database and ingesting the data

In [1]:
import psycopg2
In [2]:
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres password=padrepio1410")
cur = conn.cursor()
In [3]:
# cur.execute("""
# DROP TABLE IF EXISTS fifa19;
# CREATE TABLE fifa19 (
# index integer,
# id integer,
# name text,
# age integer,
# photo text,
# nationality text,
# flag text,
# overall integer,
# potential integer,
# club text,
# club_logo text,
# mkt_value text,
# wage text,
# special integer,
# preferred_foot text,
# international_rep integer,
# weak_foot integer,
# skill_moves integer,
# work_rate text,
# body_type text,
# real_face text,
# position text,
# jersey_number integer,
# joined text,
# loaned_from text,
# contract_valid_until text,
# height text,
# weight text,
# ls text,
# st text,
# rs text,
# lw text,
# lf text,
# cf text,
# rf text,
# rw text,
# lam text,
# cam text,
# ram text,
# lm text,
# lcm text,
# cm text,
# rcm text,
# rm text,
# lwb text,
# ldm text,
# cdm text,
# rdm text,
# rwb text,
# lb text,
# lcb text,
# cb text,
# rcb text,
# rb text,
# crossing integer,
# finishing integer,
# headingaccuracy integer,
# shortpassing integer,
# volleys integer,
# dribbling integer,
# curve integer,
# fkaccuracy integer,
# longpassing integer,
# ballcontrol integer,
# acceleration integer,
# sprintspeed integer,
# agility integer,
# reactions integer,
# balance integer,
# shotpower integer,
# jumping integer,
# stamina integer,
# strength integer,
# longshots integer,
# aggression integer,
# interceptions integer,
# positioning integer,
# vision integer,
# penalties integer,
# composure integer,
# marking integer,
# standingtackle integer,
# slidingtackle integer,
# gkdiving integer,
# gkhandling integer,
# gkkicking integer,
# gkpositioning integer,
# gkreflexes integer,
# release_clause text);
# """)
# conn.commit()


# #data ingestion 

# # with open('./fifa19.csv',"r") as f:
# #     cur.copy_from(f,'fifa19',sep=',',null="")
# #     conn.commit()

Data

First, let's take a look at the data


In [4]:
# basic operations
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

Querying data from PostgreSQL database

In [5]:
sql_command = 'SELECT * FROM fifa19'
df = pd.read_sql(sql_command, conn)
conn.close ()
In [6]:
df.head(10)
Out[6]:
index id name age photo nationality flag overall potential club club_logo mkt_value wage special preferred_foot international_rep weak_foot skill_moves work_rate body_type real_face position jersey_number joined loaned_from contract_valid_until height weight ls st rs lw lf cf rf rw lam cam ram lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb crossing finishing headingaccuracy shortpassing volleys dribbling curve fkaccuracy longpassing ballcontrol acceleration sprintspeed agility reactions balance shotpower jumping stamina strength longshots aggression interceptions positioning vision penalties composure marking standingtackle slidingtackle gkdiving gkhandling gkkicking gkpositioning gkreflexes release_clause
0 0 158023 L. Messi 31 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94 94 FC Barcelona https://cdn.sofifa.org/teams/2/light/241.png €110.5M €565K 2202 Left 5.0 4.0 4.0 Medium/ Medium Messi Yes RF 10.0 01-Jul-04 None 2021 5'7 159lbs 88+2 88+2 88+2 92+2 93+2 93+2 93+2 92+2 93+2 93+2 93+2 91+2 84+2 84+2 84+2 91+2 64+2 61+2 61+2 61+2 64+2 59+2 47+2 47+2 47+2 59+2 84.0 95.0 70.0 90.0 86.0 97.0 93.0 94.0 87.0 96.0 91.0 86.0 91.0 95.0 95.0 85.0 68.0 72.0 59.0 94.0 48.0 22.0 94.0 94.0 75.0 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Juventus https://cdn.sofifa.org/teams/2/light/45.png €77M €405K 2228 Right 5.0 4.0 5.0 High/ Low C. Ronaldo Yes ST 7.0 10-Jul-18 None 2022 6'2 183lbs 91+3 91+3 91+3 89+3 90+3 90+3 90+3 89+3 88+3 88+3 88+3 88+3 81+3 81+3 81+3 88+3 65+3 61+3 61+3 61+3 65+3 61+3 53+3 53+3 53+3 61+3 84.0 94.0 89.0 81.0 87.0 88.0 81.0 76.0 77.0 94.0 89.0 91.0 87.0 96.0 70.0 95.0 95.0 88.0 79.0 93.0 63.0 29.0 95.0 82.0 85.0 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 2 190871 Neymar Jr 26 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 93 Paris Saint-Germain https://cdn.sofifa.org/teams/2/light/73.png €118.5M €290K 2143 Right 5.0 5.0 5.0 High/ Medium Neymar Yes LW 10.0 03-Aug-17 None 2022 5'9 150lbs 84+3 84+3 84+3 89+3 89+3 89+3 89+3 89+3 89+3 89+3 89+3 88+3 81+3 81+3 81+3 88+3 65+3 60+3 60+3 60+3 65+3 60+3 47+3 47+3 47+3 60+3 79.0 87.0 62.0 84.0 84.0 96.0 88.0 87.0 78.0 95.0 94.0 90.0 96.0 94.0 84.0 80.0 61.0 81.0 49.0 82.0 56.0 36.0 89.0 87.0 81.0 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 3 193080 De Gea 27 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91 93 Manchester United https://cdn.sofifa.org/teams/2/light/11.png €72M €260K 1471 Right 4.0 3.0 1.0 Medium/ Medium Lean Yes GK 1.0 01-Jul-11 None 2020 6'4 168lbs None None None None None None None None None None None None None None None None None None None None None None None None None None 17.0 13.0 21.0 50.0 13.0 18.0 21.0 19.0 51.0 42.0 57.0 58.0 60.0 90.0 43.0 31.0 67.0 43.0 64.0 12.0 38.0 30.0 12.0 68.0 40.0 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 4 192985 K. De Bruyne 27 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91 92 Manchester City https://cdn.sofifa.org/teams/2/light/10.png €102M €355K 2281 Right 4.0 5.0 4.0 High/ High Normal Yes RCM 7.0 30-Aug-15 None 2023 5'11 154lbs 82+3 82+3 82+3 87+3 87+3 87+3 87+3 87+3 88+3 88+3 88+3 88+3 87+3 87+3 87+3 88+3 77+3 77+3 77+3 77+3 77+3 73+3 66+3 66+3 66+3 73+3 93.0 82.0 55.0 92.0 82.0 86.0 85.0 83.0 91.0 91.0 78.0 76.0 79.0 91.0 77.0 91.0 63.0 90.0 75.0 91.0 76.0 61.0 87.0 94.0 79.0 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M
5 5 183277 E. Hazard 27 https://cdn.sofifa.org/players/4/19/183277.png Belgium https://cdn.sofifa.org/flags/7.png 91 91 Chelsea https://cdn.sofifa.org/teams/2/light/5.png €93M €340K 2142 Right 4.0 4.0 4.0 High/ Medium Normal Yes LF 10.0 01-Jul-12 None 2020 5'8 163lbs 83+3 83+3 83+3 89+3 88+3 88+3 88+3 89+3 89+3 89+3 89+3 89+3 82+3 82+3 82+3 89+3 66+3 63+3 63+3 63+3 66+3 60+3 49+3 49+3 49+3 60+3 81.0 84.0 61.0 89.0 80.0 95.0 83.0 79.0 83.0 94.0 94.0 88.0 95.0 90.0 94.0 82.0 56.0 83.0 66.0 80.0 54.0 41.0 87.0 89.0 86.0 91.0 34.0 27.0 22.0 11.0 12.0 6.0 8.0 8.0 €172.1M
6 6 177003 L. Modri? 32 https://cdn.sofifa.org/players/4/19/177003.png Croatia https://cdn.sofifa.org/flags/10.png 91 91 Real Madrid https://cdn.sofifa.org/teams/2/light/243.png €67M €420K 2280 Right 4.0 4.0 4.0 High/ High Lean Yes RCM 10.0 01-Aug-12 None 2020 5'8 146lbs 77+3 77+3 77+3 85+3 84+3 84+3 84+3 85+3 87+3 87+3 87+3 86+3 88+3 88+3 88+3 86+3 82+3 81+3 81+3 81+3 82+3 79+3 71+3 71+3 71+3 79+3 86.0 72.0 55.0 93.0 76.0 90.0 85.0 78.0 88.0 93.0 80.0 72.0 93.0 90.0 94.0 79.0 68.0 89.0 58.0 82.0 62.0 83.0 79.0 92.0 82.0 84.0 60.0 76.0 73.0 13.0 9.0 7.0 14.0 9.0 €137.4M
7 7 176580 L. Suárez 31 https://cdn.sofifa.org/players/4/19/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 91 91 FC Barcelona https://cdn.sofifa.org/teams/2/light/241.png €80M €455K 2346 Right 5.0 4.0 3.0 High/ Medium Normal Yes RS 9.0 11-Jul-14 None 2021 6'0 190lbs 87+5 87+5 87+5 86+5 87+5 87+5 87+5 86+5 85+5 85+5 85+5 84+5 79+5 79+5 79+5 84+5 69+5 68+5 68+5 68+5 69+5 66+5 63+5 63+5 63+5 66+5 77.0 93.0 77.0 82.0 88.0 87.0 86.0 84.0 64.0 90.0 86.0 75.0 82.0 92.0 83.0 86.0 69.0 90.0 83.0 85.0 87.0 41.0 92.0 84.0 85.0 85.0 62.0 45.0 38.0 27.0 25.0 31.0 33.0 37.0 €164M
8 8 155862 Sergio Ramos 32 https://cdn.sofifa.org/players/4/19/155862.png Spain https://cdn.sofifa.org/flags/45.png 91 91 Real Madrid https://cdn.sofifa.org/teams/2/light/243.png €51M €380K 2201 Right 4.0 3.0 3.0 High/ Medium Normal Yes RCB 15.0 01-Aug-05 None 2020 6'0 181lbs 73+3 73+3 73+3 70+3 71+3 71+3 71+3 70+3 71+3 71+3 71+3 72+3 75+3 75+3 75+3 72+3 81+3 84+3 84+3 84+3 81+3 84+3 87+3 87+3 87+3 84+3 66.0 60.0 91.0 78.0 66.0 63.0 74.0 72.0 77.0 84.0 76.0 75.0 78.0 85.0 66.0 79.0 93.0 84.0 83.0 59.0 88.0 90.0 60.0 63.0 75.0 82.0 87.0 92.0 91.0 11.0 8.0 9.0 7.0 11.0 €104.6M
9 9 200389 J. Oblak 25 https://cdn.sofifa.org/players/4/19/200389.png Slovenia https://cdn.sofifa.org/flags/44.png 90 93 Atlético Madrid https://cdn.sofifa.org/teams/2/light/240.png €68M €94K 1331 Right 3.0 3.0 1.0 Medium/ Medium Normal Yes GK 1.0 16-Jul-14 None 2021 6'2 192lbs None None None None None None None None None None None None None None None None None None None None None None None None None None 13.0 11.0 15.0 29.0 13.0 12.0 13.0 14.0 26.0 16.0 43.0 60.0 67.0 86.0 49.0 22.0 76.0 41.0 78.0 12.0 34.0 19.0 11.0 70.0 11.0 70.0 27.0 12.0 18.0 86.0 92.0 78.0 88.0 89.0 €144.5M
In [7]:
df.shape
Out[7]:
(18207, 89)
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
index                   18207 non-null int64
id                      18207 non-null int64
name                    18207 non-null object
age                     18207 non-null int64
photo                   18207 non-null object
nationality             18207 non-null object
flag                    18207 non-null object
overall                 18207 non-null int64
potential               18207 non-null int64
club                    17966 non-null object
club_logo               18207 non-null object
mkt_value               18207 non-null object
wage                    18207 non-null object
special                 18207 non-null int64
preferred_foot          18159 non-null object
international_rep       18159 non-null float64
weak_foot               18159 non-null float64
skill_moves             18159 non-null float64
work_rate               18159 non-null object
body_type               18159 non-null object
real_face               18159 non-null object
position                18147 non-null object
jersey_number           18147 non-null float64
joined                  16654 non-null object
loaned_from             1264 non-null object
contract_valid_until    17918 non-null object
height                  18159 non-null object
weight                  18159 non-null object
ls                      16122 non-null object
st                      16122 non-null object
rs                      16122 non-null object
lw                      16122 non-null object
lf                      16122 non-null object
cf                      16122 non-null object
rf                      16122 non-null object
rw                      16122 non-null object
lam                     16122 non-null object
cam                     16122 non-null object
ram                     16122 non-null object
lm                      16122 non-null object
lcm                     16122 non-null object
cm                      16122 non-null object
rcm                     16122 non-null object
rm                      16122 non-null object
lwb                     16122 non-null object
ldm                     16122 non-null object
cdm                     16122 non-null object
rdm                     16122 non-null object
rwb                     16122 non-null object
lb                      16122 non-null object
lcb                     16122 non-null object
cb                      16122 non-null object
rcb                     16122 non-null object
rb                      16122 non-null object
crossing                18159 non-null float64
finishing               18159 non-null float64
headingaccuracy         18159 non-null float64
shortpassing            18159 non-null float64
volleys                 18159 non-null float64
dribbling               18159 non-null float64
curve                   18159 non-null float64
fkaccuracy              18159 non-null float64
longpassing             18159 non-null float64
ballcontrol             18159 non-null float64
acceleration            18159 non-null float64
sprintspeed             18159 non-null float64
agility                 18159 non-null float64
reactions               18159 non-null float64
balance                 18159 non-null float64
shotpower               18159 non-null float64
jumping                 18159 non-null float64
stamina                 18159 non-null float64
strength                18159 non-null float64
longshots               18159 non-null float64
aggression              18159 non-null float64
interceptions           18159 non-null float64
positioning             18159 non-null float64
vision                  18159 non-null float64
penalties               18159 non-null float64
composure               18159 non-null float64
marking                 18159 non-null float64
standingtackle          18159 non-null float64
slidingtackle           18159 non-null float64
gkdiving                18159 non-null float64
gkhandling              18159 non-null float64
gkkicking               18159 non-null float64
gkpositioning           18159 non-null float64
gkreflexes              18159 non-null float64
release_clause          16643 non-null object
dtypes: float64(38), int64(6), object(45)
memory usage: 12.4+ MB

This dataset contains over 80 fields, some not applying to our analysis.

We'll drop some fields that are less meaningful.

In [9]:
df.drop(columns=['index','id','photo','flag','club_logo','special','real_face','jersey_number','contract_valid_until','release_clause'],inplace=True)

Handling Missing Values


The table seems to have many missing values. We'll have to properly handle these values before we can continue with modifying the dataset


In [10]:
# List of fields with missing values and count
df.isnull().sum().sort_values(ascending=False)
Out[10]:
loaned_from        16943
rwb                 2085
rf                  2085
lam                 2085
cam                 2085
ram                 2085
lm                  2085
lcm                 2085
cm                  2085
rcm                 2085
rm                  2085
lwb                 2085
ldm                 2085
cdm                 2085
rdm                 2085
rb                  2085
rcb                 2085
cb                  2085
lcb                 2085
rw                  2085
cf                  2085
lb                  2085
lw                  2085
rs                  2085
st                  2085
ls                  2085
lf                  2085
joined              1553
club                 241
position              60
                   ...  
penalties             48
vision                48
positioning           48
interceptions         48
aggression            48
finishing             48
strength              48
longshots             48
jumping               48
longpassing           48
headingaccuracy       48
shortpassing          48
volleys               48
shotpower             48
curve                 48
fkaccuracy            48
dribbling             48
ballcontrol           48
acceleration          48
sprintspeed           48
agility               48
reactions             48
balance               48
wage                   0
mkt_value              0
potential              0
overall                0
nationality            0
age                    0
name                   0
Length: 79, dtype: int64

We'll deal with the loaned_from column first as it is almost fully missing

A null values under this column simply means these players are not on loan, which makes sense why majority have a null value. Dropping this column makes sense, however I'm interested in seeing if loans affect player wage. So we'll simply change this column to boolean values

In [11]:
df['on_loan'] = (df['loaned_from'].notnull()).astype(int)
df['on_loan'].unique()
Out[11]:
array([0, 1], dtype=int64)
In [12]:
#dropping df['on_loan'] as it is not needed anymore
df.drop(columns=['loaned_from'],inplace=True)

Next, we'll deal with the club column and fill missing values with 'No Club'


In [13]:
df['club'].fillna(value='No Club',inplace=True)

There appears to be many players with 'No Club', I assume that these player will have little value and no wage. If this is the case these players should be dropped as they will skew the results.

Let's investigate further


In [14]:
len(df['club'][df['club'] == 'No Club'])
Out[14]:
241
In [15]:
df_noClub = df[(df['wage'] == '€ 0') & (df['club']=='No Club')][['name','club','mkt_value','wage']]
df_noClub.head()
Out[15]:
name club mkt_value wage
453 L. Paredes No Club € 0 € 0
540 A. Granqvist No Club € 0 € 0
570 A. Lunev No Club € 0 € 0
678 I. Smolnikov No Club € 0 € 0
875 A. Dzyuba No Club € 0 € 0
In [16]:
print('Number of players with no wage and club: ',len(df_noClub))
Number of players with no wage and club:  241

The results align with our hypothesis, showing that all players with no club indeed have low transfer market value and no wage

These players only make up a small portion of the dataset so removing them makes sense.

In [17]:
##Dropping these players after analysis
drop_list = drop_list=df[(df['wage']=='€ 0') & (df['club']=='No Club')].index
df.drop(index=drop_list,inplace=True)

Next, to deal with the joined column we'll we'll create a new column named 'months_at_club' which stores the number of months a player has been at that club for. For missing values we'll assign a value of 0.


In [18]:
from datetime import datetime

def num_months(df_value):
    try:
        year = datetime.strptime(df_value,"%d-%b-%y")
        delta = datetime.now()-year
        return divmod(delta.days,30.436875)[0]
    except (ValueError, TypeError):
        value = 0
    return value

df['months_at_club'] = df['joined'].apply(num_months)
In [19]:
## Dropping df['joined'] as it not needed anymore
df.drop(columns=['joined'],inplace=True)
df[['name','club','months_at_club']].head()
Out[19]:
name club months_at_club
0 L. Messi FC Barcelona 183.0
1 Cristiano Ronaldo Juventus 15.0
2 Neymar Jr Paris Saint-Germain 26.0
3 De Gea Manchester United 99.0
4 K. De Bruyne Manchester City 49.0

Remaining null values listed below

In [20]:
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
Out[20]:
rb                 2040
rm                 2040
cam                2040
ram                2040
lm                 2040
lcm                2040
cm                 2040
rcm                2040
lwb                2040
rcb                2040
ldm                2040
cdm                2040
rdm                2040
rwb                2040
lb                 2040
lcb                2040
lam                2040
rw                 2040
rf                 2040
cf                 2040
lf                 2040
lw                 2040
rs                 2040
st                 2040
ls                 2040
cb                 2040
weight               48
height               48
position             48
body_type            48
                   ... 
aggression           48
interceptions        48
positioning          48
vision               48
composure            48
crossing             48
marking              48
standingtackle       48
slidingtackle        48
gkdiving             48
gkhandling           48
gkkicking            48
stamina              48
jumping              48
shotpower            48
balance              48
reactions            48
agility              48
sprintspeed          48
acceleration         48
ballcontrol          48
longpassing          48
fkaccuracy           48
curve                48
dribbling            48
volleys              48
shortpassing         48
headingaccuracy      48
finishing            48
preferred_foot       48
Length: 69, dtype: int64

Looking at the remaining list, majority of the columns are player attributes (sprintspeed, dribbling, volleys etc.) and have the same number of missing values

Let's explore this by filtering for missing attribute values only


In [21]:
## Acceleration was randomly chosen from attributes to analyze
df_null = df[(df['acceleration'].isnull())]
df_null.head()
Out[21]:
name age nationality overall potential club mkt_value wage preferred_foot international_rep weak_foot skill_moves work_rate body_type position height weight ls st rs lw lf cf rf rw lam cam ram lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb crossing finishing headingaccuracy shortpassing volleys dribbling curve fkaccuracy longpassing ballcontrol acceleration sprintspeed agility reactions balance shotpower jumping stamina strength longshots aggression interceptions positioning vision penalties composure marking standingtackle slidingtackle gkdiving gkhandling gkkicking gkpositioning gkreflexes on_loan months_at_club
8804 A. Al Malki 23 Saudi Arabia 62 67 Al Wehda €350K €3K None NaN NaN NaN None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0
8818 E. Guerrero 27 Chile 62 65 CD Palestino €300K €1K None NaN NaN NaN None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0
8832 Hernáiz 20 Spain 62 69 Albacete BP €425K €1K None NaN NaN NaN None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0
8846 H. Al Mansour 25 Saudi Arabia 62 64 Al Nassr €300K €5K None NaN NaN NaN None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0
8860 H. Paul 24 Germany 62 66 TSV 1860 München €325K €1K None NaN NaN NaN None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0
In [22]:
columns = set(df.columns) #list of columns
non_null_list = df_null.dropna(axis=1, how='all').columns #list of non null columns in above df
non_null_count = len(non_null_list)
total_column_count = len(df.columns)
null_count = total_column_count - non_null_count
remaining = [x for x in columns if x not in non_null_list]

print('Number of records: {}'.format(len(df_null)))

print('List of fields that only contain null entries in the above df:',"\n")
print(remaining,"\n")

print("Number of null fields: {} of {}".format(null_count,total_column_count))
print("Percentage: {:.2%}".format(null_count/total_column_count))
Number of records: 48
List of fields that only contain null entries in the above df: 

['rdm', 'acceleration', 'penalties', 'reactions', 'lf', 'ram', 'gkhandling', 'crossing', 'ldm', 'slidingtackle', 'skill_moves', 'composure', 'st', 'cam', 'volleys', 'fkaccuracy', 'cb', 'lwb', 'headingaccuracy', 'rcb', 'interceptions', 'shotpower', 'positioning', 'standingtackle', 'rcm', 'height', 'rf', 'body_type', 'preferred_foot', 'international_rep', 'weight', 'marking', 'finishing', 'curve', 'rb', 'rm', 'lam', 'vision', 'agility', 'gkreflexes', 'lcm', 'stamina', 'lcb', 'ballcontrol', 'dribbling', 'lm', 'cdm', 'rwb', 'strength', 'work_rate', 'cf', 'ls', 'rw', 'balance', 'cm', 'longpassing', 'weak_foot', 'gkkicking', 'lb', 'jumping', 'shortpassing', 'longshots', 'lw', 'aggression', 'rs', 'sprintspeed', 'gkpositioning', 'position', 'gkdiving'] 

Number of null fields: 69 of 79
Percentage: 87.34%

Interesting, over 85% of total fields in the filtered table have no values in them.

These players will not help with our analysis as they do not have enough information. In total there are only 48 players, therefore dropping these rows will not have a large effect, so we'll go ahead and remove them from the table.


In [23]:
drop_list=df[df['acceleration'].isnull()].index
df.drop(index=drop_list,inplace=True)

Remaining null values listed below

In [24]:
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
Out[24]:
rb     1992
rcb    1992
st     1992
rs     1992
lw     1992
lf     1992
cf     1992
rf     1992
rw     1992
lam    1992
cam    1992
ram    1992
lm     1992
lcm    1992
cm     1992
rcm    1992
rm     1992
lwb    1992
ldm    1992
cdm    1992
rdm    1992
rwb    1992
lb     1992
lcb    1992
cb     1992
ls     1992
dtype: int64

The rest of the null values are only from positions other than GK.

I assume these players are all Goalkeepers but we'll check to be sure.


In [25]:
df_gk = df[df['position']=='GK']
df_gk.sample(10).iloc[:,14:43]
Out[25]:
position height weight ls st rs lw lf cf rf rw lam cam ram lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb
14587 GK 6'2 168lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
11587 GK 6'4 194lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
10080 GK 6'0 163lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
12730 GK 6'2 181lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
14728 GK 6'1 185lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
10515 GK 5'11 152lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
17357 GK 6'3 190lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
11160 GK 6'3 183lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
13333 GK 6'4 181lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
16923 GK 6'1 190lbs None None None None None None None None None None None None None None None None None None None None None None None None None None
In [26]:
len(df_gk)
Out[26]:
1992

This confirms our hypothesis. Let's fill the remaining nulls with a value of 0


In [27]:
df.fillna(value=0,inplace = True);
In [28]:
df.isnull().sum().sum()
Out[28]:
0

Our dataset now does not contain any missing values!

We can now proceed with data cleaning and manipulation




Converting appropriate categorical features to numerical features


Currency values are listed in Euros and not in standard notation, making them hard to work with.

Let's write some functions to convert these values.


In [29]:
df[['mkt_value','wage']].head()
Out[29]:
mkt_value wage
0 €110.5M €565K
1 €77M €405K
2 €118.5M €290K
3 €72M €260K
4 €102M €355K
In [30]:
def convert(df_value):
    try:
        value = float(df_value[1:-1])
        suffix = df_value[-1:]
        conversion = 1

        if suffix == 'M':
            value = value * 1000000 * conversion
        elif suffix == 'K':
            value = value * 1000 * conversion
    except (ValueError, TypeError):
        value = 0
    return value

df['mkt_value'] = df['mkt_value'].apply(convert)
df['wage'] = df['wage'].apply(convert)
In [31]:
#result
df[['mkt_value','wage']].head()
Out[31]:
mkt_value wage
0 110500000.0 565000.0
1 77000000.0 405000.0
2 118500000.0 290000.0
3 72000000.0 260000.0
4 102000000.0 355000.0


Similarly, weight and height are listed in lbs and feet/inches respectively.

Let's convert those to numerical values.


In [32]:
df[['weight','height']].head()
Out[32]:
weight height
0 159lbs 5'7
1 183lbs 6'2
2 150lbs 5'9
3 168lbs 6'4
4 154lbs 5'11
In [33]:
def heightConverter(df_value):
    feet = df_value.split("'")[0]
    inches = df_value.split("'")[1]
    height = (int(feet) * 30.48) + (int(inches)*2.54)
    return height

def weightConverter(df_value):
    weight = int(df_value[:-3])
    return weight

df['height(cm)'] = df['height'].apply(heightConverter)
df['weight(lbs)'] = df['weight'].apply(weightConverter)
df.drop(columns=['height','weight'],inplace=True)
In [34]:
df[['height(cm)','weight(lbs)']].head()
Out[34]:
height(cm) weight(lbs)
0 170.18 159
1 187.96 183
2 175.26 150
3 193.04 168
4 180.34 154


Next, we'll convert the position ratings to numerical values as well and group them into attacking and defensive ratings


In [35]:
df[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
       'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
       'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']].head()
Out[35]:
ls st rs lw lf cf rf rw lam cam ram lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb
0 88+2 88+2 88+2 92+2 93+2 93+2 93+2 92+2 93+2 93+2 93+2 91+2 84+2 84+2 84+2 91+2 64+2 61+2 61+2 61+2 64+2 59+2 47+2 47+2 47+2 59+2
1 91+3 91+3 91+3 89+3 90+3 90+3 90+3 89+3 88+3 88+3 88+3 88+3 81+3 81+3 81+3 88+3 65+3 61+3 61+3 61+3 65+3 61+3 53+3 53+3 53+3 61+3
2 84+3 84+3 84+3 89+3 89+3 89+3 89+3 89+3 89+3 89+3 89+3 88+3 81+3 81+3 81+3 88+3 65+3 60+3 60+3 60+3 65+3 60+3 47+3 47+3 47+3 60+3
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 82+3 82+3 82+3 87+3 87+3 87+3 87+3 87+3 88+3 88+3 88+3 88+3 87+3 87+3 87+3 88+3 77+3 77+3 77+3 77+3 77+3 73+3 66+3 66+3 66+3 73+3
In [36]:
def ratingConverter(df_value):
    df_value = str(df_value)
    if "+" in df_value:
        v1 = df_value[0:2]
        v2 = df_value[-1]
        df_value = int(v1) + int(v2)
        return df_value
    else:
        return int(df_value)

atr_columns = ['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
       'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
       'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']

for col in atr_columns:
    df[col] = df[col].apply(ratingConverter)
In [37]:
df[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
       'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
       'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']].head()
Out[37]:
ls st rs lw lf cf rf rw lam cam ram lm lcm cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb
0 90 90 90 94 95 95 95 94 95 95 95 93 86 86 86 93 66 63 63 63 66 61 49 49 49 61
1 94 94 94 92 93 93 93 92 91 91 91 91 84 84 84 91 68 64 64 64 68 64 56 56 56 64
2 87 87 87 92 92 92 92 92 92 92 92 91 84 84 84 91 68 63 63 63 68 63 50 50 50 63
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 85 85 85 90 90 90 90 90 91 91 91 91 90 90 90 91 80 80 80 80 80 76 69 69 69 76
In [38]:
df['att_rate'] = (df['rf'] + df['st'] + df['lf'] + df['rs'] + df['ls'] + df['cf']\
                 + df['lw'] + df['rcm'] + df['lcm'] + df['cam'] + df['rm']\
                 + df['lam'] + df['lm'] + df['rw'] + df['cm'] + df['ram'] ) / 16

df['def_rate'] = (df['rcb'] + df['cb'] + df['lcb'] + df['lb'] + df['rb'] + df['rwb']\
                 + df['lwb']+ df['ldm']+ df['rdm'] + + df['cdm']) / 10


df.drop(columns=['rf', 'st', 'lw', 'rcm', 'lf', 'rs', 'rcb', 'lcm', 'cb', 'ldm', 'cam', 'cdm',
                     'ls', 'lcb', 'rm', 'lam', 'lm', 'lb', 'rdm', 'rw', 'cm', 'rb', 'ram', 'cf', 'rwb', 'lwb'
                    ], inplace=True)
In [39]:
df[['name','att_rate','def_rate']].head(4)
Out[39]:
name att_rate def_rate
0 L. Messi 92.0000 59.0
1 Cristiano Ronaldo 90.7500 62.4
2 Neymar Jr 89.4375 60.1
3 De Gea 0.0000 0.0



As seen below, some player body types are inconsistent and don't comply with the standard classification: ('Lean','Normal','Stocky')


In [40]:
df[~df['body_type'].isin(('Lean', 'Normal','Stocky'))][['name','body_type']]
Out[40]:
name body_type
0 L. Messi Messi
1 Cristiano Ronaldo C. Ronaldo
2 Neymar Jr Neymar
20 T. Courtois Courtois
27 M. Salah PLAYER_BODY_TYPE_25
371 X. Shaqiri Shaqiri
9567 A. Akinfenwa Akinfenwa

Let's change these values so that body_type is consistent.


In [41]:
df['body_type'][df['body_type'] == 'Messi'] = 'Lean'
df['body_type'][df['body_type'] == 'C. Ronaldo'] = 'Normal'
df['body_type'][df['body_type'] == 'Neymar'] = 'Lean'
df['body_type'][df['body_type'] == 'Courtois'] = 'Lean'
df['body_type'][df['body_type'] == 'PLAYER_BODY_TYPE_25'] = 'Normal' ##PLAYER_BODY_TYPE_25 corresponds to M.Salah
df['body_type'][df['body_type'] == 'Shaqiri'] = 'Stocky'
df['body_type'][df['body_type'] == 'Akinfenwa'] = 'Stocky'



To make the position field data easier to work with, we'll classify player positions into three categories F(attacking), M(midfield), D(defensive) and GK(goalkeeper)


In [42]:
df['position'].unique()
Out[42]:
array(['RF', 'ST', 'LW', 'GK', 'RCM', 'LF', 'RS', 'RCB', 'LCM', 'CB',
       'LDM', 'CAM', 'CDM', 'LS', 'LCB', 'RM', 'LAM', 'LM', 'LB', 'RDM',
       'RW', 'CM', 'RB', 'RAM', 'CF', 'RWB', 'LWB'], dtype=object)
In [43]:
def positionConverter(df_value):
    if df_value in ["RF","ST","LW","LF","RS","LS","RW","CF"]:
        df_value = 'F'
        return df_value
    elif df_value in ["RCM","LCM","LDM","CAM","CDM","RM","LAM","LM","RDM","CM","RAM"]:
        df_value = 'M'
        return df_value
    elif df_value in ["RCB","CB","LCB","LB","RB","RWB","LWB"]:
        df_value = 'D'
        return df_value
    else:
        return df_value
    
df['position'] = df['position'].apply(positionConverter)
In [44]:
df[['name','position']].sample(5)
Out[44]:
name position
10610 N. Brock-Madsen F
4354 K. Müller GK
14264 H. Charsley M
4344 J. Quiñones D
8392 O. Framelin GK



Exploratory Data Analysis and Visualization:


In [45]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot


pd.options.display.float_format = '{:.2f}'.format

General Visualizations


What are the major nationalities in FIFA 19?

In [46]:
df_countries = df.groupby(by='nationality').size().reset_index()
df_countries.columns = ['country','count']
In [47]:
fig = go.Figure(data=go.Choropleth(
    locationmode = 'country names',
    locations = df_countries['country'],
    z = df_countries['count'],
    text = df_countries['country'],
    colorscale = 'Cividis',
    autocolorscale=False,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Number of Players',
))

fig.update_layout(
    title_text='Player count Distribution per Country',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    )
)

fig.show(renderer='notebook')

Most players are from European and South American countries

Top 5 countries

  • England - 1657
  • Germany - 1195
  • Spain - 1071
  • Argentina - 936
  • France -911

Top 10 Most Valuable Clubs

In [48]:
df_club = df[['club','mkt_value']].groupby(['club'],as_index=False).sum()
df_club = df_club.sort_values('mkt_value',ascending=False).head(10)
In [49]:
fig = go.Figure(go.Bar(
    y=df_club['club'],
    x=df_club['mkt_value'],
    orientation='h'))

fig.update_layout(title_text='Total Value by Club')
fig.show(renderer='notebook')

No surprise that Real Madrid and FC Barcelona have the top spots as they have the likes of Messi and Ronaldo - the highest valued players in the game

Transfer Market Value vs Age vs Overall

In [50]:
fig = go.Figure(data=go.Scatter(
    y = df['mkt_value'],
    x=df['age'],
    mode='markers',
    marker = dict(color=df['overall'],colorbar=dict(title='Overall'))
))

fig.update_layout(title_text = "Value vs Age vs Overall")
fig.update_xaxes(title_text='Age(Years)')
fig.update_yaxes(title_text='Value (Euros)')


fig.show(renderer='notebook')

I

Feature Engineering:


Correlation Analysis

In [51]:
cat_col = df.select_dtypes(include=object).columns
num_col = df.select_dtypes(exclude=object).columns
print('Numerical features: ',num_col.values)
Numerical features:  ['age' 'overall' 'potential' 'mkt_value' 'wage' 'international_rep'
 'weak_foot' 'skill_moves' 'crossing' 'finishing' 'headingaccuracy'
 'shortpassing' 'volleys' 'dribbling' 'curve' 'fkaccuracy' 'longpassing'
 'ballcontrol' 'acceleration' 'sprintspeed' 'agility' 'reactions'
 'balance' 'shotpower' 'jumping' 'stamina' 'strength' 'longshots'
 'aggression' 'interceptions' 'positioning' 'vision' 'penalties'
 'composure' 'marking' 'standingtackle' 'slidingtackle' 'gkdiving'
 'gkhandling' 'gkkicking' 'gkpositioning' 'gkreflexes' 'on_loan'
 'months_at_club' 'height(cm)' 'weight(lbs)' 'att_rate' 'def_rate']

Here is a list of all numerical features from our table

'age', 'overall', 'potential', 'mkt_value', 'wage', 'international_rep', 'weak_foot', 'skill_moves', 'crossing', 'finishing', 'headingaccuracy', 'shortpassing', 'volleys', 'dribbling', 'curve', 'fkaccuracy', 'longpassing', 'ballcontrol', 'acceleration', 'sprintspeed', 'agility', 'reactions', 'balance', 'shotpower', 'jumping', 'stamina', 'strength', 'longshots', 'aggression', 'interceptions', 'positioning', 'vision', 'penalties', 'composure', 'marking', 'standingtackle', 'slidingtackle', 'gkdiving', 'gkhandling', 'gkkicking', 'gkpositioning', 'gkreflexes', 'on_loan', 'months_at_club', 'height(cm)', 'weight(lbs)', 'att_rate', 'def_rate'

To better determine correlation between player attributes and other features, we'll group the features into attacking, defensive and goalkeeping categories

In [52]:
df_1 = df.copy()
df_1['ovr_pace'] = df[['acceleration','sprintspeed']].mean(axis=1)


df_1['ovr_shooting'] = df[['positioning','finishing','fkaccuracy',\
                     'shotpower','longshots','volleys','penalties']].mean(axis=1)

df_1['ovr_passing'] = df[['vision','crossing','fkaccuracy',\
                    'shortpassing','longpassing','curve']].mean(axis=1)

df_1['ovr_dribbling'] = df[['agility','balance','reactions',\
                      'ballcontrol','dribbling','composure']].mean(axis=1)

df_1['ovr_defending'] = df[['interceptions','headingaccuracy',\
                      'marking','standingtackle','slidingtackle']].mean(axis=1)

df_1['ovr_physical'] = df[['jumping','stamina','strength','aggression']].mean(axis=1)

df_1['ovr_gk'] = df[['gkdiving','gkhandling','gkkicking','gkpositioning','gkreflexes']].mean(axis=1)

remove = ['acceleration','sprintspeed','positioning','finishing','fkaccuracy','shotpower','longshots','volleys','penalties',
        'vision','crossing','fkaccuracy','shortpassing','longpassing','curve', 'agility','balance','reactions','ballcontrol',\
        'dribbling','composure','interceptions','headingaccuracy','marking','standingtackle','slidingtackle','jumping',\
        'stamina','strength','aggression','gkdiving','gkhandling','gkkicking','gkpositioning','gkreflexes']

df_1.drop(columns=remove,inplace = True)
In [53]:
df_cat = df_1.select_dtypes(include=object).columns
df_num = df_1.select_dtypes(exclude=object).columns

print('Remaining numerical features in table: ', df_num.values)
Remaining numerical features in table:  ['age' 'overall' 'potential' 'mkt_value' 'wage' 'international_rep'
 'weak_foot' 'skill_moves' 'on_loan' 'months_at_club' 'height(cm)'
 'weight(lbs)' 'att_rate' 'def_rate' 'ovr_pace' 'ovr_shooting'
 'ovr_passing' 'ovr_dribbling' 'ovr_defending' 'ovr_physical' 'ovr_gk']

Remaining numerical features in table

age' 'overall' 'potential' 'mkt_value' 'wage' 'international_rep' 'weak_foot' 'skill_moves' 'on_loan' 'months_at_club' 'height(cm)' 'weight(lbs)' 'att_rate' 'def_rate' 'ovr_pace' 'ovr_shooting' 'ovr_passing' 'ovr_dribbling' 'ovr_defending' 'ovr_physical' 'ovr_gk'

In [54]:
corr = df_1.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(20, 9))

# Colourway
cmap = sns.diverging_palette(220, 10, as_cmap=True)


sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=0.1, cbar_kws={"shrink": .5}, annot=True,annot_kws={'size':8},fmt='.2')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x15dbc77a898>

Transfer Market Value vs Overall Rating

In [56]:
fig = go.Figure(data=go.Scatter(
    y = df_1['mkt_value'],
    x=df_1['overall'],
    mode='markers',
    marker = dict(color=df_1['wage'],colorbar=dict(title='Wage'))
))

fig.update_layout(title_text = "Overall vs Market Value vs Wage")
fig.update_xaxes(title_text='Overall')
fig.update_yaxes(title_text='Market Value')


fig.show(renderer='notebook')
In [57]:
value_corr = df_1.corr().abs()['mkt_value'].sort_values(ascending=True)
In [58]:
fig = go.Figure(data=go.Scatter(
    y = value_corr.values ,
    x=value_corr.index,
    mode='markers'
))

fig.update_layout(title_text = "Correlation with Label/Dependent Variable")
fig.update_xaxes(title_text='Field')
fig.update_yaxes(title_text='Correlation')


fig.show(renderer='notebook')
In [59]:
df.drop(columns=['international_rep','overall','wage'],inplace=True)

Transfer Market Tier Classification

In [60]:
df['mkt_value'].describe()
Out[60]:
count       17918.00
mean      2448628.75
std       5631804.21
min             0.00
25%        325000.00
50%        700000.00
75%       2100000.00
max     118500000.00
Name: mkt_value, dtype: float64
In [61]:
plt.figure(1, figsize=(18, 7))
ax = sns.countplot( x= 'mkt_value', data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=5)
plt.title('Market Value Distribution')
Out[61]:
Text(0.5, 1.0, 'Market Value Distribution')

--

Observations

mkt_value values are messy but we can still detect outliers having extremely high transfer market values.
These players can be grouped as 'Superstar' players, like Messi, Ronaldo, Hazard etc.
The distribution is multi modal
Most high rated players with an abnormal transfer market value seem to be causing a wide spread

Approach

Remove outliers and recompute the mean
That is, remove 90 percentile and above and recalculate mean value

In [62]:
realistic_value = df['mkt_value'][df['mkt_value'] < df['mkt_value'].quantile(0.9)]
realistic_value.describe()
Out[62]:
count     16051.00
mean    1094139.31
std     1288174.95
min           0.00
25%      290000.00
50%      600000.00
75%     1200000.00
max     6000000.00
Name: mkt_value, dtype: float64

This looks a bit better and well choose the mean as the threshold for low and high wage player

In [63]:
threshold = realistic_value.mean()

Now we'll create a modified dataframe with the required information


In [77]:
df.head()
Out[77]:
name age nationality potential club mkt_value preferred_foot weak_foot skill_moves work_rate body_type position crossing finishing headingaccuracy shortpassing volleys dribbling curve fkaccuracy longpassing ballcontrol acceleration sprintspeed agility reactions balance shotpower jumping stamina strength longshots aggression interceptions positioning vision penalties composure marking standingtackle slidingtackle gkdiving gkhandling gkkicking gkpositioning gkreflexes on_loan months_at_club height(cm) weight(lbs) att_rate def_rate
0 L. Messi 31 Argentina 94 FC Barcelona 110500000.00 Left 4.00 4.00 Medium/ Medium Lean F 84.00 95.00 70.00 90.00 86.00 97.00 93.00 94.00 87.00 96.00 91.00 86.00 91.00 95.00 95.00 85.00 68.00 72.00 59.00 94.00 48.00 22.00 94.00 94.00 75.00 96.00 33.00 28.00 26.00 6.00 11.00 15.00 14.00 8.00 0 183.00 170.18 159 92.00 59.00
1 Cristiano Ronaldo 33 Portugal 94 Juventus 77000000.00 Right 4.00 5.00 High/ Low Normal F 84.00 94.00 89.00 81.00 87.00 88.00 81.00 76.00 77.00 94.00 89.00 91.00 87.00 96.00 70.00 95.00 95.00 88.00 79.00 93.00 63.00 29.00 95.00 82.00 85.00 95.00 28.00 31.00 23.00 7.00 11.00 15.00 14.00 11.00 0 15.00 187.96 183 90.75 62.40
2 Neymar Jr 26 Brazil 93 Paris Saint-Germain 118500000.00 Right 5.00 5.00 High/ Medium Lean F 79.00 87.00 62.00 84.00 84.00 96.00 88.00 87.00 78.00 95.00 94.00 90.00 96.00 94.00 84.00 80.00 61.00 81.00 49.00 82.00 56.00 36.00 89.00 87.00 81.00 94.00 27.00 24.00 33.00 9.00 9.00 15.00 15.00 11.00 0 26.00 175.26 150 89.44 60.10
3 De Gea 27 Spain 93 Manchester United 72000000.00 Right 3.00 1.00 Medium/ Medium Lean GK 17.00 13.00 21.00 50.00 13.00 18.00 21.00 19.00 51.00 42.00 57.00 58.00 60.00 90.00 43.00 31.00 67.00 43.00 64.00 12.00 38.00 30.00 12.00 68.00 40.00 68.00 15.00 21.00 13.00 90.00 85.00 87.00 88.00 94.00 0 99.00 193.04 168 0.00 0.00
4 K. De Bruyne 27 Belgium 92 Manchester City 102000000.00 Right 5.00 4.00 High/ High Normal M 93.00 82.00 55.00 92.00 82.00 86.00 85.00 83.00 91.00 91.00 78.00 76.00 79.00 91.00 77.00 91.00 63.00 90.00 75.00 91.00 76.00 61.00 87.00 94.00 79.00 88.00 68.00 58.00 51.00 15.00 13.00 5.00 10.00 13.00 0 49.00 180.34 154 89.38 75.90
In [64]:
df1 = df.copy()
In [65]:
## setting binary values to low wage and high wage
df1['high_value'] = (df1['mkt_value'] > threshold).astype(int)
In [66]:
df1.drop(columns=['mkt_value','name','nationality','club'],inplace=True)
In [67]:
df1.head()
Out[67]:
age potential preferred_foot weak_foot skill_moves work_rate body_type position crossing finishing headingaccuracy shortpassing volleys dribbling curve fkaccuracy longpassing ballcontrol acceleration sprintspeed agility reactions balance shotpower jumping stamina strength longshots aggression interceptions positioning vision penalties composure marking standingtackle slidingtackle gkdiving gkhandling gkkicking gkpositioning gkreflexes on_loan months_at_club height(cm) weight(lbs) att_rate def_rate high_value
0 31 94 Left 4.00 4.00 Medium/ Medium Lean F 84.00 95.00 70.00 90.00 86.00 97.00 93.00 94.00 87.00 96.00 91.00 86.00 91.00 95.00 95.00 85.00 68.00 72.00 59.00 94.00 48.00 22.00 94.00 94.00 75.00 96.00 33.00 28.00 26.00 6.00 11.00 15.00 14.00 8.00 0 183.00 170.18 159 92.00 59.00 1
1 33 94 Right 4.00 5.00 High/ Low Normal F 84.00 94.00 89.00 81.00 87.00 88.00 81.00 76.00 77.00 94.00 89.00 91.00 87.00 96.00 70.00 95.00 95.00 88.00 79.00 93.00 63.00 29.00 95.00 82.00 85.00 95.00 28.00 31.00 23.00 7.00 11.00 15.00 14.00 11.00 0 15.00 187.96 183 90.75 62.40 1
2 26 93 Right 5.00 5.00 High/ Medium Lean F 79.00 87.00 62.00 84.00 84.00 96.00 88.00 87.00 78.00 95.00 94.00 90.00 96.00 94.00 84.00 80.00 61.00 81.00 49.00 82.00 56.00 36.00 89.00 87.00 81.00 94.00 27.00 24.00 33.00 9.00 9.00 15.00 15.00 11.00 0 26.00 175.26 150 89.44 60.10 1
3 27 93 Right 3.00 1.00 Medium/ Medium Lean GK 17.00 13.00 21.00 50.00 13.00 18.00 21.00 19.00 51.00 42.00 57.00 58.00 60.00 90.00 43.00 31.00 67.00 43.00 64.00 12.00 38.00 30.00 12.00 68.00 40.00 68.00 15.00 21.00 13.00 90.00 85.00 87.00 88.00 94.00 0 99.00 193.04 168 0.00 0.00 1
4 27 92 Right 5.00 4.00 High/ High Normal M 93.00 82.00 55.00 92.00 82.00 86.00 85.00 83.00 91.00 91.00 78.00 76.00 79.00 91.00 77.00 91.00 63.00 90.00 75.00 91.00 76.00 61.00 87.00 94.00 79.00 88.00 68.00 58.00 51.00 15.00 13.00 5.00 10.00 13.00 0 49.00 180.34 154 89.38 75.90 1

Creating dummy variables


In [68]:
#splitting player workrates into two columns
#from documentation, first value is attacking work rate and second is defensive work rate
df1['work_rate_a'] = df1['work_rate'].str.split('/',1,expand=True)[0]
df1['work_rate_d'] = df1['work_rate'].str.split('/',1,expand=True)[1]

df1.drop(columns=['work_rate'],inplace=True)
In [69]:
cat_vars = ['preferred_foot','body_type','position','work_rate_a','work_rate_d']
for var in cat_vars:
    cat_list = 'var'+'_'+var
    cat_list = pd.get_dummies(df1[var],prefix=var)
    df2 = df1.join(cat_list)
    df1=df2


cat_vars = ['preferred_foot','body_type','position','work_rate_a','work_rate_d']
data_vars=df1.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

These are the remaining columns in the data set


In [70]:
df_final=df1[to_keep]
df_final.columns.values
Out[70]:
array(['age', 'potential', 'weak_foot', 'skill_moves', 'crossing',
       'finishing', 'headingaccuracy', 'shortpassing', 'volleys',
       'dribbling', 'curve', 'fkaccuracy', 'longpassing', 'ballcontrol',
       'acceleration', 'sprintspeed', 'agility', 'reactions', 'balance',
       'shotpower', 'jumping', 'stamina', 'strength', 'longshots',
       'aggression', 'interceptions', 'positioning', 'vision',
       'penalties', 'composure', 'marking', 'standingtackle',
       'slidingtackle', 'gkdiving', 'gkhandling', 'gkkicking',
       'gkpositioning', 'gkreflexes', 'on_loan', 'months_at_club',
       'height(cm)', 'weight(lbs)', 'att_rate', 'def_rate', 'high_value',
       'preferred_foot_Left', 'preferred_foot_Right', 'body_type_Lean',
       'body_type_Normal', 'body_type_Stocky', 'position_D', 'position_F',
       'position_GK', 'position_M', 'work_rate_a_High', 'work_rate_a_Low',
       'work_rate_a_Medium', 'work_rate_d_ High', 'work_rate_d_ Low',
       'work_rate_d_ Medium'], dtype=object)

Over-sampling using SMOTE

The threshold for high market value players was not determined by the 50 percentile of the value data and instead lowered to better split the players This caused an under representation of the high market value players as seen in the bar chart below.

We'll up-sample the high-value players using the SMOTE algorithm(Synthetic Minority Oversampling Technique).

In [72]:
plt.figure(1, figsize=(18, 7))
ax = sns.countplot( x= 'high_value', data=df1)
plt.title('Count of High Market Value Players vs Low Market Value Players')
Out[72]:
Text(0.5, 1.0, 'Count of High Market Value Players vs Low Market Value Players')
In [71]:
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
In [73]:
X = df_final.loc[:, df_final.columns != 'high_value']
y = df_final.loc[:, df_final.columns == 'high_value']

from imblearn.over_sampling import SMOTE

os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns

os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['high_value'])
#we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y['high_value']==0]))
print("Number of subscription",len(os_data_y[os_data_y['high_value']==1]))
print("Proportion of no subscription data in oversampled data is ",len(os_data_y[os_data_y['high_value']==0])/len(os_data_X))
print("Proportion of subscription data in oversampled data is ",len(os_data_y[os_data_y['high_value']==1])/len(os_data_X))
length of oversampled data is  15844
Number of no subscription in oversampled data 7922
Number of subscription 7922
Proportion of no subscription data in oversampled data is  0.5
Proportion of subscription data in oversampled data is  0.5

Length of oversampled data is 15888

Number of no subscription in oversampled data 7944

Number of subscription 7944

Proportion of no subscription data in oversampled data is 0.5

Proportion of subscription data in oversampled data is 0.5


We now have a perfectly balanced dataset!



Recursive Feature Elimination


Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.


In [74]:
df_final_vars=df_final.columns.values.tolist()
y=['high_value']
X=[i for i in df_final_vars if i not in y]

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
rfe = RFE(logreg, 10)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
 [False False False  True False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False  True  True False False
 False  True  True False  True  True False  True False  True  True]
[10  7 45  1 38 28 17 18 49 21 41 50 39 13 46 19 47  8 14 25 40 32 12 44
 48 27 31 43 34 15 37 36 24 30 20 33 22 16  9 42 11 35 26 23  1  1  4  5
  6  1  1 29  1  1  2  1  3  1  1]
In [75]:
index_list = []
for i,t in enumerate(rfe.support_):
    if(t==True):
        index_list.append(i)
cols_list = df_final.columns.values

cols_list = np.delete(cols_list,np.where(cols_list == 'high_value'))
cols = []
for i in index_list:
     cols.append(cols_list[i])

cols
Out[75]:
['skill_moves',
 'preferred_foot_Left',
 'preferred_foot_Right',
 'position_D',
 'position_F',
 'position_M',
 'work_rate_a_High',
 'work_rate_a_Medium',
 'work_rate_d_ Low',
 'work_rate_d_ Medium']
In [76]:
X=os_data_X[cols]
y=os_data_y['high_value']

The RFE has helped us select the following features:

'skill_moves', 'preferred_foot_Left', 'preferred_foot_Right', 'position_D', 'position_F', 'position_GK', 'position_M', 'work_rate_a_Medium', 'work_rate_d_ Low', 'work_rate_d_ Medium'



Feature Reduction


In [77]:
import statsmodels.api as sm
df_final_vars=df_final.columns.values.tolist()

logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.559702
         Iterations 6
                           Results: Logit
=====================================================================
Model:                Logit             Pseudo R-squared:  0.193     
Dependent Variable:   high_value        AIC:               17755.8218
Date:                 2019-10-10 09:53  BIC:               17832.5273
No. Observations:     15844             Log-Likelihood:    -8867.9   
Df Model:             9                 LL-Null:           -10982.   
Df Residuals:         15834             LLR p-value:       0.0000    
Converged:            1.0000            Scale:             1.0000    
No. Iterations:       6.0000                                         
---------------------------------------------------------------------
                      Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------------
skill_moves           2.2643   0.0462  48.9997 0.0000  2.1737  2.3549
preferred_foot_Left  -2.0130   0.1219 -16.5097 0.0000 -2.2520 -1.7740
preferred_foot_Right -1.8499   0.1106 -16.7225 0.0000 -2.0667 -1.6331
position_D           -2.5340   0.0841 -30.1470 0.0000 -2.6988 -2.3693
position_F           -3.2343   0.1050 -30.8102 0.0000 -3.4401 -3.0286
position_M           -3.4622   0.1023 -33.8414 0.0000 -3.6627 -3.2617
work_rate_a_High     -0.0663   0.0934  -0.7099 0.4778 -0.2494  0.1168
work_rate_a_Medium   -0.1565   0.0868  -1.8035 0.0713 -0.3265  0.0136
work_rate_d_ Low     -0.7948   0.0837  -9.4953 0.0000 -0.9588 -0.6307
work_rate_d_ Medium  -0.8619   0.0521 -16.5434 0.0000 -0.9640 -0.7598
=====================================================================

The p-values for most of the variables are smaller than 0.05, except 1 variables, therefore, we will remove it.

In [78]:
cols2 = ['work_rate_a_High']
cols = [x for x in cols if x not in cols2]
X = os_data_X[cols]
y=os_data_y['high_value']

logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.559717
         Iterations 6
                           Results: Logit
=====================================================================
Model:                Logit             Pseudo R-squared:  0.192     
Dependent Variable:   high_value        AIC:               17754.3249
Date:                 2019-10-10 09:53  BIC:               17823.3598
No. Observations:     15844             Log-Likelihood:    -8868.2   
Df Model:             8                 LL-Null:           -10982.   
Df Residuals:         15835             LLR p-value:       0.0000    
Converged:            1.0000            Scale:             1.0000    
No. Iterations:       6.0000                                         
---------------------------------------------------------------------
                      Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------------
skill_moves           2.2585   0.0454  49.6990 0.0000  2.1694  2.3475
preferred_foot_Left  -2.0579   0.1042 -19.7440 0.0000 -2.2622 -1.8537
preferred_foot_Right -1.8931   0.0924 -20.4903 0.0000 -2.0742 -1.7120
position_D           -2.5237   0.0828 -30.4961 0.0000 -2.6859 -2.3615
position_F           -3.2300   0.1048 -30.8259 0.0000 -3.4354 -3.0247
position_M           -3.4548   0.1018 -33.9530 0.0000 -3.6542 -3.2554
work_rate_a_Medium   -0.1030   0.0432  -2.3833 0.0172 -0.1877 -0.0183
work_rate_d_ Low     -0.7984   0.0835  -9.5568 0.0000 -0.9622 -0.6347
work_rate_d_ Medium  -0.8661   0.0518 -16.7261 0.0000 -0.9676 -0.7646
=====================================================================

In [ ]:
 


Model Development:


In [79]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[79]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [80]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.70

Accuracy of logistic regression classifier on test set: 0.70


Confusion Matrix

In [81]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
[[1624  796]
 [ 607 1727]]
In [82]:
ax = plt.subplot()
sns.heatmap(confusion_matrix,annot=True,ax=ax,cmap="Blues",fmt='g',xticklabels=['Positive','Negative'], yticklabels=['Positive','Negative'])
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x2022f5b6e80>

The result is telling us that we have [1624 + 1727] correct predictions and [796 + 607] incorrect predictions.



Model Evaluation:


Let's evaluate the model using model evaluation metrics such as accuracy and precision.

In [83]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.73      0.67      0.70      2420
           1       0.68      0.74      0.71      2334

    accuracy                           0.70      4754
   macro avg       0.71      0.71      0.70      4754
weighted avg       0.71      0.70      0.70      4754


Interpretation:

When our Logistic Regression model predicted players having a high market value, those player had a high value 70% of the time.



ROC Curve


In [84]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--') # ROC curve of a purely random classifier
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

AUC score for the case is 0.71. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier


Improvements

  • Investigate multicollinearity in model. Ensure that independent variables are independent of each other
  • Gather more training data
  • Dimensionality reduction
  • More model testing and repeating test. train loop